-
GPUs Explained - Branch Education .
-
Cool.
-
-
From smallest to largest.
Thread
-
Has its own registers and private variables.
Warp / Wave
-
Fixed-size group of threads executed together in lockstep.
-
These are hardware scheduling units — the smallest batch of threads that can be executed together.
-
The SM/CU scheduler runs one warp/wave at a time on its execution units.
-
The SM does not literally execute a whole warp in a single cycle always; the SM issues instructions for a warp and can interleave instructions from multiple warps to hide latency.
-
The warp is the smallest scheduling/issue granularity, but instruction dispatch and active lanes depend on pipeline and issue width.
-
Reconvergence is implemented by hardware (and compiler) mechanisms; divergent threads are masked and the SM executes each path serially until reconvergence.
-
-
Wavefront (AMD) or Warp (NVIDIA).
-
Common sizes:
-
NVIDIA warp: 32 threads.
-
AMD Wavefront: 64 threads.
-
-
These threads share a program counter and execute the same instruction at the same time (SIMT model).
-
If threads diverge in control flow, the hardware masks off threads not taking the current branch until they reconverge.
Workgroup (API abstraction)
-
Defined by you in Vulkan’s compute shader or GLSL/HLSL.
-
Group of threads that can share shared memory within a single SM/CU.
-
A workgroup is scheduled to a single SM/CU for its lifetime while active, but a workgroup may be split across multiple SMs over time if the runtime re-schedules (e.g., after preemption or context switch).
-
Practically, code semantics assume the workgroup runs on a single SM until completion.
-
-
Size: arbitrary (within hardware limits), e.g.,
local_size_x = 256. -
Purpose: group of threads that:
-
Can share shared memory / LDS .
-
Can synchronize using
barrier()calls.
-
-
Hardware: The entire workgroup runs within one SM/CU (so they can share its on-chip memory).
-
Example:
-
If you set
local_size_x = 256, that’s 256 threads in the workgroup.
-
-
Subgroup (API Abstraction)
-
Subset of threads in a workgroup that maps to a warp/wave (e.g., 32 threads).
-
Enables warp-level operations (shuffles, reductions) without shared memory .
-
Exposed in Vulkan/OpenCL; size is queried at runtime.
-
SM (Streaming Multiprocessor) / CU (Compute Unit)
-
SM = Streaming Multiprocessor (NVIDIA terminology).
-
CU = Compute Unit (AMD terminology).
-
Hardware block that runs multiple warps/waves.
-
This is the fundamental hardware block that executes shader threads.
-
Has per-SM caches and shared memory.
-
Registers (not shared between SMs).
-
Shared memory (LDS).
-
L1 cache (per-SM).
-
Implication: If one SM’s L1 cache is filled with certain data, another SM won’t see it — coherence happens at L2.
-
-
Resource Partitioning :
-
Fixed registers/thread (e.g., 255 regs/thread on NVIDIA Ampere).
-
Shared memory configurable (e.g., 64–164 KB on NVIDIA).
-
-
Concurrent Execution :
-
Runs multiple warps/waves simultaneously (e.g., 64 warps/SM on NVIDIA).
-
Hides latency via zero-cost warp switching.
-
-
Each SM/CU has :
-
Its own set of registers (private to threads assigned to it).
-
Its own shared memory / LDS (Local Data Store), accessible to all threads in a workgroup.
-
Access to L1 cache and special-function units (SFUs).
-
-
Vulkan equivalent: A workgroup in a compute shader runs entirely within one SM/CU.
TPC/GPC (Texture Processing Cluster / Graphics Processing Cluster) / SA (Shader Array)
-
A group of SMs/CUs that may share intermediate caches or specialized hardware.
-
NVIDIA :
-
"Texture Processing Cluster" (TPC) or "Graphics Processing Cluster" (GPC)
-
Groups 2–8 SMs sharing raster/tessellation units.
-
-
AMD :
-
"Shader Array" (SA) in RDNA
-
Uses “Shader Array” or “Workgroup Processor” as a cluster-like grouping.
-
Groups 2 CUs sharing instruction cache/ray accelerators.
-
-
Shared at cluster level: Sometimes texture units, geometry units, or a shared instruction cache.
GPU Die
-
Graphics Engine (e.g., AMD's Shader Engine, NVIDIA's GPC).
-
Contains multiple clusters + fixed-function units (geometry, raster).
GPU
-
All clusters together, sharing the L2 cache and global memory.